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ABSTRACT 

A major fraction of the transcriptome of higher or- 
ganisms comprised an extensive repertoire of long 
non-coding RNA (IncRNA) which express in a cell 
type and development stage-specific manner. 
While IncRNAs are a proven component of epigen- 
etic gene expression modulation, epigenetic regula- 
tion of IncRNA itself remains poorly understood. 
Here we have analysed pan-genomic DNA methyla- 
tion and histone modification marks (H3K4me3, 
H3K9me3, H3K27me3 and H3K36me3) associated 
with transcription start site (TSS) of IncRNA in four 
different cell types and three different tissue types 
representing various cellular stages. We observe 
that histone marks associated with active transcrip- 
tion H3K4me3 and H3K36me3 along with the 
repressive histone mark H3K27me3 have similar dis- 
tribution pattern around TSS irrespective of cell 
types. Also, the density of these marks correlates 
well with expression of protein-coding and IncRNA 
genes. In contrast, the IncRNA genes harbour higher 
methylation density around TSS than protein-coding 
genes regardless of their expression status. 
Furthermore, we found that DNA methylation along 
with the other repressive histone mark H3K9me3 
does not seem to play a role in IncRNA expression. 
Thus, our observation suggests that epigenetic 
regulation of IncRNA shares common features with 
mRNA except the role of DNA methylation which is 
markedly dissimilar. 

INTRODUCTION 

The outcome of the ENCODE project and subsequent 
studies have revealed that majority of eukaryotic tran- 
scripts do not code for proteins (1). Such non-coding 
RNAs (ncRNAs) had been reported previously but were 
generally accepted to be transcriptional noise and/or 



experimental artefact (2). However, it has now been estab- 
lished that expression of ncRNA is cell- and developmen- 
tal stage-specific with strong association between aberrant 
expression and manifestation of disease condition (3-7). 
Greater degree of evolutionary complexity has been linked 
to concomitant increase in ncRNA diversity which 
suggests that ncRNAs fall under evolutionary selection 
paradigms and therefore should critically affect cell and 
hence organism identity (8,9). ncRNAs have diverse func- 
tions and are key intermediary in chromatin organization 
and gene regulation (10-15). 

Recent genome-scale transcriptome maps have revealed 
a significant subset of these transcripts, form a distinct 
class of ncRNAs, presently known as long non-coding 
RNAs (IncRNAs). Though the molecular basis of the 
function of many IncRNAs is just emerging, the present 
understanding indicates their intricate roles in regulation 
of a wide variety of biological processes (16). Some of the 
IncRNAs are conserved in mammals though conservation 
is not a general rule for this class (17). LncRNAs have 
been reported to affect chromatin, peripheral to their 
loci of expression (cis) as well as genomic regions distant 
from their loci of expression (trans) (18). A large number 
of mammalian IncRNAs are increasingly being recognized 
as key regulators of chromatin organization, mediating 
important biological processes such as X-chromosome in- 
activation (Xist), imprinting (Kcnqlotl) and gene expres- 
sion at transcriptional level (Hotair) (12,19-22). Several 
IncRNAs modulate chromatin structure by recruiting the 
polycomb group of proteins to their target sites resulting 
in Histone3 lysine27 methylation-induced silencing (23). 
Although a huge number of IncRNAs have been identified 
in genome-wide transcriptome analysis, little is known re- 
garding the spatio-temporal regulation of IncRNA expres- 
sion (24). 

Considering that IncRNAs have the potential to 
regulate the chromatin state, the transcription of 
IncRNAs itself must be tightly regulated. Similar to 
protein-coding genes, most IncRNAs are transcribed by 
RNA pol II and have typical hallmarks of pol II 
transcribed products like 5' Cap and poly A tail (25). 
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Further Pol II-mediated gene expression is known to be 
regulated by epigenetic mechanisms like DNA and histone 
modifications (26). Also, since many IncRNAs are 
expressed in cell type/tissue- and developmental stage- 
specific manner, it is extremely likely that their own 
expression is epigenetically monitored (27). 

Epigenetic mechanisms regulating expression of 
protein-coding genes are well characterized. Promoter 
hypomethylation and histone modifications (like 
H3K4me3, H3K27me3, H3K9me3 and H3K36me3) are 
some of the epigenetic marks that are understood to be 
the key regulators of mRNA expression. However, unlike 
the protein-coding genes, a systematic analysis of these 
epigenetic features in IncRNA genes has not yet been 
undertaken. Although there are a few reports that have 
described epigenetic regulation in specific IncRNAs, 
studies of the epigenetic patterns at a global scale espe- 
cially in and around the transcription start site (TSS) of 
IncRNA genes are scarce (28,29). 

Therefore, in this study we performed genome-wide 
analysis of the distribution of DNA methylation and his- 
tone modifications like H3K4me3, H3K9me3, H3K27me3 
and H3K36me3 across the TSS of all known IncRNA 
genes in different cell and tissue types. To assess the 
effect of these chromatin modifications on gene expres- 
sion, we analysed the gene expression data of brain 
tissue and HI cells where the data were available. Our 
results suggest that IncRNA shave histone marks 
associated with active transcription in a manner similar 
to that of the protein-coding genes, while they differ in 
the repressive marks like DNA methylation and 
H3K9me3 histone modification. 



MATERIALS AND METHODS 

Human genome and annotations 

We used the human genome hgl9 build (http://www.ncbi 
. nlm. nih . go v/proj ects/ genome /guide/human /index . shtml) 
from the University of California Santa Cruz Genome 
Bioinformatics Site (http://genome.ucsc.edu) which was 
used as the reference for mapping raw reads (30). 
RefSeq genes (in total 40 845, of which 30 623 were simi- 
larly retrieved from the site and only unique entries 
were used for analysis), CpG island (CGI) positions 
(28 691) and ORegAnno (http://genome.ucsc.edu) 
(23 089) datasets were similarly retrieved from the UCSC 
Genome Browser for the same build of the human 
genome. Datasets for cytosine methylation were retrieved 
from methylated DNA immunoprecipitation sequencing 
(MeDIPSeq) experiments for four different sets, namely, 
HI human embryonic stem cells (HI), tissue from 
germinal centre of human brain (Brain Gr) and IMR90 
embryonic lung fibroblast cells from NIH Roadmap 
Epigenomics project (31). Another set of MeDIP data 
for the tissue from frontal cortex region of human brain 
(Brain Fr) was downloaded from NCBI-SRA. The raw 
data for transcriptome sequencing (mRNA seq) and 
histone modification (H3K4me3) for HI cells and Brain 
cortical tissue were similarly retrieved from NIH 
Roadmap Epigenomics project and NCBI-SRA, 



respectively. The data for histone modifications 
(H3K4me3, H3K9me3, H3K27me3 and H3K36me3) of 
four different cell types (HI, IMR90, CD34 primary 
cells and peripheral blood mononucleocytes) and two dif- 
ferent tissue types were downloaded from NIH Roadmap 
Epigenomics project. For peripheral blood mononuclear 
cells (PBMCs), H3K4me3 data were not available and 
therefore we have downloaded H3K4mel data and have 
performed the analysed with this dataset. The genome co- 
ordinates of the IncRNA genes (11004) and 
protein-coding genes (20012) were obtained from the 
Gencode website and Ensembl genome browsers, respect- 
ively (32,33). 

Read mapping and annotation of features 

The raw reads of MeDIPseq dataset downloaded for HI 
cells and brain cortical tissue, were mapped onto the 
human genome reference sequence (hgl9 build) using the 
Burrows- Wheeler Alignment Tool algorithm on default 
parameters (34). For annotation and transcript quantifi- 
cation of RNA-seq data of Hlcells and brain cortical 
tissue, we used a pipeline comprising Tophat (1.3.3) and 
Cufflinks (1.2.0). The rest of the data used were down- 
loaded in the aligned format from the source mentioned 
above (35,36). 

Analysis of MeDIP datasets 

We used Model-based Analysis for ChlP-Seq (MACS) 
(version 1.4.0 beta) for peak detection and analysis of 
immunoprecipitated sequencing data to find genomic 
regions that are enriched in a pool of specifically 
precipitated DNA fragments (37). MACS was run on 
default parameters on aligned files of methylation data 
(HI cells, PBMCs, brain germinal and cortical tissues), 
histone modification datasets (HI cells, IMR90 cells, 
PBMCs, CD34 primary cells, liver tissue and brain 
germinal centre tissue) and enriched peaks were generated 
(Supplementary Table SI). 

In-depth analysis and data integration 

In-depth analysis, data integration and comparison were 
performed using custom scripts written in Perl. The 
methylation peak summit files generated by MACS were 
then used for further downstream analysis. Summit peak 
files of methylation data and histone data were used for 
looking at their differential pattern across TSS of 
protein-coding and IncRNA genes. An enriched gene file 
generated by Tophat (1.3.3) and Cufflinks (1.2.0) was used 
for classifying genes. We used an empirical cutoff of 1 SD 
from the mean to classify genes as high and low expressed. 
Comparison of the various marks across the TSS of 
protein-coding and IncRNA genes was performed using 
custom scripts. 

For finding the co-occurrence of one or more of the 
epigenetic marks (methylation and histone modification 
marks) at the TSS of IncRNA and protein-coding 
regions, we calculated the number of these events falling 
in the ± 2 kb of TSS in both cell types. To plot the data we 
have used Venny (http://bioinfogp.cnb.csic.es/tools/ 
venny /index . html) . 
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RESULTS 

Distinct patterns of DNA methylation across TSS of 
protein-coding and IncRNA genes 

Expression of ncRNAs and differential methylation marks 
are both components of the tissue differentiation machin- 
ery. The methylation architecture in and around the 
protein-coding genes affects their expression and hence 
influence cell identity (38). However, the role of DNA 
methylation in the regulation of IncRNA genes remains 
unclear. We first compared the average methylation 
density within exon, introns and promoters (2-kb 
upstream of TSS) of IncRNA and protein-coding genes 
from HI cell line, PBMCs, brain cortical tissue and 
brain germinal matrix tissue. We found that the methyla- 
tion density within these regions was similar — with exons 
having higher methylation density than introns or pro- 
moters (Figure 1A and B). However, the methylation 
density around TSS was markedly different between 
IncRNA and protein-coding genes in all the cell and 
tissue types studied (Figure 2A-D). The average methyla- 
tion density around the TSS of protein-coding genes 
showed a V-shaped curve indicative of low methylation 
levels (Figure 2A-D), which is in concordance with 
earlier reports (39,40). Contrary to the pattern of methy- 
lation across TSS of protein-coding genes, we did not find 
the characteristic dip in methylation density at TSS in 
IncRNA. Rather we found an increased methylation 
density with a sharp peak immediately downstream of 
the TSS, in the region of first exon in IncRNA genes 
(Figure 2A-D). This suggests a differential pattern of 
methylation across the TSS of IncRNAs vis-a-vis 
protein-coding genes which might be due to a potential 
difference in gene regulation across these loci. 
Alternately, the difference in methylation pattern could 
also be due to partial overlap of some of the IncRNAs 
with exons of protein-coding genes since previously we 
and others have demonstrated that exons of protein- 
coding genes (coding exons) harbour a higher methylation 
density compared to introns and untranslated regions 
(39,41,42). To rule out this possibility, methylation 
density of IncRNAs that fall within protein-coding genes 
(~4000) and those that lie 1 kb up- or downstream of the 
protein-coding genes (~7000) were separately analysed. In 
both the cases we found that the methylation patterns 
were consistent with the initial analysis of the superset in 
all the cases (Supplementary Figure SI). 

To investigate the potential effect of such distinct TSS 
methylation pattern on the transcription of IncRNA 
genes, we analysed the RNA sequencing data from 
HI cells and brain frontal cortex tissue. For this, we down- 
loaded the data from NCBI-Sequence Read Archive and 
processed it through Tophat and Cufflink pipelines for 
RNA-seq analysis. We considered all transcripts with sig- 
nificant Fragment Per Kilobase of exon Model per million 
mapped fragments (FPKM) values. Genes that had 
expression levels greater or lower than 1 SD from the 
mean were considered to be highly or lowly expressed, 
respectively (Supplementary Table SI). From this 
analysis we found that there were 3532 and 4624 highly 
expressed protein-coding genes in HI cells and brain 



cortical tissue, respectively, while 1839 and 1415 
protein-coding genes were found to be lowly expressed in 
HI cells and brain cortical tissue, respectively. Similarly 
there were 119 and 171 highly expressed IncRNAs in HI 
cells and brain cortical tissue, respectively, while 2938 and 
3665 IncRNAs were found to be lowly expressed in HI 
cells and brain cortical tissue respectively. 

As expected we found a dip in the methylation density 
at the TSS of highly expressed protein-coding genes in 
both HI cells and brain cortical tissue (Figure 3 A and 
B). However, this dip in methylation at TSS was absent 
for protein-coding genes that were lowly expressed in HI 
cells and brain cortical tissue (Figure 3A and B). 
Interestingly, in these lowly expressed protein-coding 
genes we observed high methylation density immediately 
downstream of TSS (Figure 3A and B). Highly expressed 
IncRNAs, in brain cortical tissue but not in HI cells (119), 
had lower levels of methylation upstream of TSS (~1 kb). 
However, in both datasets highly expressed IncRNAs 
exhibit sharp increase in methylation density immedi- 
ately downstream of TSS (Figure 3C and D). On the 
other hand, the methylation pattern of lowly expressed 
IncRNAs of HI cells and brain cortical tissue was 
similar to the pattern exhibited in lowly expressed 
mRNAs with a sharp peak immediately downstream of 
the TSS (Figure 3C and D). The increased methylation 
density immediately downstream of TSS was a feature 
associated with IncRNA (both highly and lowly expressed) 
and lowly expressed protein-coding genes. These observa- 
tions suggest that, irrespective of their expression status, 
IncRNA seems to have elevated methylation density 
downstream of their TSS. 

Distribution of histone modification marks across TSS 
of protein-coding and IncRNA genes 

Histone modifications like H3K4me3 and H3K27me3/ 
H3K9me3 are known to be associated with active and 
inactive promoters of protein-coding genes, respectively. 
Another feature associated with protein-coding genes is 
the association of the transcription coupled chromatin 
mark H3K36me3 within the gene body of active genes. 
We thus examined the distribution of these marks ~5 kb 
of TSS of IncRNA and protein-coding genes to include 
H3K36me3 marks also. As mentioned earlier this 
analysis was performed in four different cell types (HI, 
IMR90, CD34 and PBMC) and two tissue types (brain 
germinal matrix and liver). The pattern of H3K4me3 dis- 
tribution surrounding the TSS of IncRNAs was found to 
be similar with that of protein-coding genes around the 
TSS. However, the density of this mark in IncRNA was 
considerably lower in all cell or tissue types analysed 
(Figure 4A-F). The difference in the H3K4me3 density 
between IncRNA and protein-coding genes was more 
pronounced in HI cell line when compared with other 
cell and tissue types (Figure 4A). Another mark that is 
known to be associated with actively transcribed regions 
(gene body-exons) is H3K36me3 modification. We found 
that the downstream region of TSS of protein-coding 
genes consists of elevated H3K36me3 signals, irrespective 
of the cell or tissue type (Figure 5A-F). In contrast, 
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Figure 1. Methylation density within promoter, exons and introns was calculated by dividing the methylation peak summit count in that region by 
the area of that region. (A) The methylation density in the different bins of protein-coding genes in HI cell, PBMCs, brain frontal cortex (Fr) and 
brain germinal matrix tissue (Gr). (B) The methylation density in the different bins of IncRNA genes in HI cell, PBMCs, brain frontal cortex (Fr) 
and brain germinal matrix tissue (Gr). 



IncRN A genes do not show any enrichment of H3K36me3 
histone modification across all the studied cell and tissue 
types (Figure 5A-F). 

The repressive mark H3K27me3 showed a tissue- 
specific distribution pattern around the TSS with, 
PBMCs, IMR90 cells, CD34 cells and liver tissue 
showing very low levels of H3K27me3 at the TSS of 
mRNA as well as IncRNA genes (Figure 6C, D, E and 
F). In protein-coding genes of brain germinal matrix 
tissue, there was a sharp increase in H3K27me3 density 
around TSS while in HI cells it was considerably lower 
than brain germinal tissue (Figure 6A and B). It was also 
observed that IncRNA genes harbour lower levels of 
H3K27me3 modification than protein-coding genes in all 
the datasets (Figure 6A-F). 

In the case of H3K9me3 modification, which has also 
been implicated in heterochromatin formation, we found 
that the density of the modification was in general 
low across the sample sets studied. In IMR90 and 
PBMC the density of this modification was higher than 
the rest (Figure 7A-F). Furthermore, there was no fixed 
pattern of distribution of H3K9me3 modification among 
the cell or tissue type in protein-coding and IncRNA 
genes. 



To further assess the effect of these modifications on the 
transcription, we analysed the gene expression profiles of 
HI cells since this is the only sample for which the expres- 
sion data were available. We found that highly expressed 
protein-coding and IncRNA genes were enriched 
for H3K4me3 and H3K36me3 modification than lowly 
expressed genes (Supplementary Figure S2A-D). 
However, the TSS of highly expressed mRNA had very 
low presence of H3K27me3 mark while the TSS of lowly 
expressed mRNA exhibited markedly elevated levels of 
H3K27me3 (Supplementary Figure S2E and F). 
Similarly, TSS of lowly expressed IncRNAs harbour 
elevated H3K27me3 levels, albeit at levels far lower than 
their protein-coding counterparts. TSS of highly expressed 
IncRNA exhibited a total absence of H3K27me3 mark in 
their immediate vicinity (Supplementary Figure S2E 
and F). In the case of the other repressive histone mark 
H3K9me3, highly expressed protein-coding genes exhibit 
a fall in H3K9me3 levels immediately upstream of the TSS 
(Supplementary Figure S2G and H). In contrast, the TSS 
of lowly expressed protein-coding regions had a sharp 
increase in H3K9me3 levels around the TSS. In contrast, 
in IncRNA genes, the TSS exhibited high levels of 
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Figure 2. Methylation pattern around TSS. Distribution of methylation peak summit count in 100-bp continuous window, 5-kb upstream and 
downstream from the start site was calculated for all protein-coding genes and IncRNA genes in brain frontal cortex (A), brain germinal matrix tissue 
(B), HI cell (C) and PBMCs (D). Count was normalized by dividing individual count with total number of genes in that category. The plots obtained 
were further smoothened by taking a moving average of 5. 



H3K9me3 irrespective of expression status (Sup- 
plementary Figure S2G and H). 

The IncRNA genes (1 1 004) downloaded from Gencode 
v9 comprise four sub-categories, namely, lincRNAs (5890 
genes), antisense (3588 genes), processed transcripts (1117 
genes) and sense intronic transcripts (409 genes). It is a 
well-known fact that lincRNAs are marked by H3K4me3 
and H3K36me3 modifications that lie outside mRNA 
genes. We also individually analysed all the four sub- 
categories of IncRNAs included in our study to ascertain 
if H3K4me3 and H3K36me3 marks observed were solely 
due to lincRNA. We found that sense intronic which has 
very few entries all other sub-categories has similar distri- 
bution of histone marks around the TSS (Supplementary 
Figure S3A and B). 

Association of epigenetic marks with CGIs present 
around TSS IncRNA genes 

Several studies have shown that there is a strong correl- 
ation between CGIs and transcription initiation (43). 
We thus plotted the CGI density ~5kb up- and down- 
stream of the TSS of IncRNA to assess if the promoters 
of IncRNA genes are also rich in CGI. We found that 
although the CGI density is high at the TSS of IncRNA 
compared to random regions, it was considerably lower 
than the CGI density at the TSS of protein-coding genes 
(Figure 8A). Furthermore, CGIs are frequently associated 
with H3K4me3 marks, which itself is a signature of active 



promoters (43). Thus, we looked for the histone modifica- 
tions associated with the CGI at the TSS of protein-coding 
and IncRNA genes. For this purpose we made four 
classes, namely, protein-coding genes with or without 
CGI and IncRNA genes with or without CGI, based on 
the presence of CGI in ± 2 kb of TSS of the genes. After 
sorting the genes into these classes we mapped the location 
of H3K4me3, H3K9me3 and H3K27me3 modifications at 
± 2 kb of TSS of these regions across all four cell and two 
tissue types. The count in each class was normalized to the 
total number of entries in that class (Figure 8B and 
Supplementary Table S2). 

H3K4me3 marks are enriched in both protein-coding 
and IncRNA genes having CGI while it is low in similar 
regions lacking CGI irrespective of the cell or tissue type. 
H3K9me3 mark showed no enrichment with any class in 
any cell or tissue types. H3K27me3 on the other hand 
showed higher density in brain germinal matrix tissue in 
both protein-coding and IncRNA genes having CGI, while 
the rest of the sample set showed no enrichment with 
CGI for both protein-coding and IncRNA genes 
(Figure 8B). 

Since the start sites of genes are known to be enriched 
for various cis regulatory regions, we looked at the distri- 
bution of regulatory sites present in ORegAnno database 
(database of regulatory sites from UCSC) around the TSS 
of protein-coding and IncRNA genes. Here also we found 
that the start sites of the IncRNA genes were enriched for 
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Figure 3. Association of average methylation density around TSS with gene expression. (A and B) represents the methylation density around TSS of 
highly and lowly expressed protein-coding genes in brain tissue and HI cell line. (C and D) represents the methylation density around TSS of highly 
and lowly expressed IncRNA genes in brain tissue and HI cell line. Peak summit count in 100-bp continuous window was normalized by dividing 
count with total number of genes in that category. The plots were further smoothened by taking a moving average of 5. 



known regulatory motifs but to an extent lesser than 
protein-coding genes (Supplementary Figure S4). 

Global analysis of histone marks across TSSs of 
protein-coding and IncRNA genes 

We mapped the histone distribution ~2kb up- or down- 
stream of the TSS of protein-coding and IncRNA genes. 
The percentage occupancy of each modification for both 
the classes of genes was calculated by normalizing each 
data count to the total number of entries in that 
category. We found that overall occupancy of these 
histone marks ~2kb up or downstream of the TSS of 
protein-coding genes in a particular tissue/cell type falls 
between 65% and 73%, while in IncRNA genes the same 
ranges around 27-38% (Supplementary Table S3). 
Furthermore, when DNA methylation is also taken into 
account the count of epigenetically marked protein-coding 
genes increases to >75% in case of HI cells, PBMCs and 
brain germinal matrix tissue. Similarly, for these sam- 
ples, the count of epigenetically marked IncRNA genes 
rises to >43% on inclusion of DNA methylation. 
Evaluation of the density of individual marks in 
this window revealed that >50% of protein-coding genes 



have H3K4me3 mark in all cell/tissue types. In IncRNA 
genes also, the occupancy of this marks was high, ~23% 
across all cell and tissue types (except PBMC — 17%). 
Another known transcription activating mark 
H3K36me3 showed very low occupancy around TSS 
(>10%) in all cell/tissue types, in both the protein-coding 
and IncRNA genes (Supplementary Table S3). In case of 
repressive histone marks, no general pattern was observed. 
Instead, there were variations in promoter occupancy 
across all cell and tissue types for both protein-coding 
and IncRNA genes (Supplementary Table S3). 
We further analysed the possibility of synergy between 
the aforesaid histone marks and evaluated the coexistence 
of two or more of these histone modifications at 2 kb up- 
or downstream of TSS of mRNA and IncRNA genes 
(Supplementary File SI). Among the various combin- 
ations we found that presence of H3K4me3 and 
H3K27me3 marks, which are classically known as 
bivalent domains, was more prominent than other 
combinations in all studied cell and tissue types. In all 
the cell and tissue types studied, except brain germinal 
tissue, we observe that the percentage of mRNA genes 
having these bivalent marks vary from 1% to 10%, the 
lowest being in liver tissue. In the IncRNA genes it varies 
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Figure 4. Distribution of H3K4me3 marks across the TSS of protein-coding and IncRNA genes in different cell and tissue types. H3K4me3 
distribution ~5kb up and downstream of TSS of protein-coding and IncRNA genes of HI cells (A), brain germinal matrix tissue (B), IMR 90 
cells (C), CD34 cells (D), liver tissue (E) and PBMCs (F). Count was normalized by dividing individual count with total number of genes in that 
category. The plots obtained were further smoothened by taking a moving average of 5. 



from 0.4% to 3.2% with liver being the lowest in 
this case also. However, brain germinal matrix tissue 
exhibits exceptionally higher percentage of bivalent 
marks in both mRNA (-41.5%) and IncRNA (12.6%) 
genes. 

Since this study was based on data generated by various 
laboratories, we checked the robustness of the data by 
mapping the epigenetic marks around TSS of a few 
regions including housekeeping genes and some IncRNA 
genes. A similar pattern of the distribution of these epi- 
genetic modifications across the cell and tissue types under 
investigation gave us confidence on the robustness of the 
data (Supplemental File S2). Further, the MeDIP methy- 
lation dataset used from brain cortical region was 
validated by the same group using targeted bisulphite 
sequencing method (44). 



DISCUSSION 

Recent advances in high-throughput sequencing tech- 
nologies have revealed that >90% of the human genome 
is transcribed, of which only 1-2% accounts directly for 
protein synthesis (45). It is increasingly evident, in humans 
and other organisms, that the transcriptome is signifi- 
cantly more complex than previously supposed RNA 
having a much broader influence over manifested pheno- 
type than implied solely by its role as messenger. 
Epigenetic mechanisms like cytosine methylation and 
histone modifications are known to influence gene expres- 
sion. While aberrations of the epigenome have been found 
to be associated with several human diseases and dis- 
orders, there have been increasing reports associating 
aberrant IncRNA expression with cancer, cardiovascular 
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disorders and other maladies (46,47). However, associ- 
ation of epigenomic features like cytosine methylation 
and histone modifications with IncRNA genes has not 
been studied at the genome- wide level. 

In the present report we have tried to draw a global 
picture of epigenetic marks across IncRNA loci in 
human. The epigenetic marks studied here include histone 
modifications and DNA methylation, which have been ex- 
tensively studied recently with relation to regulation of 
protein-coding genes. We performed a comprehensive 
analysis of DNA methylation, H3K27me3 and H3K9me3 
as representative repressive marks, which have been 
known to be associated with chromatin repression and 
H3K4me3 and H3K36me3 which are representative 
expression-associated marks. The complete raw datasets 



covering the transcription repressive and activating marks 
were obtained from the NCBI repository. Datasets that are 
still under embargo could not be included in the analysis 
(Supplementary Table S4). In addition, we have not 
included datasets from in vitro differentiated, stem-cell- 
derived and transformed cell types since they are likely to 
have altered epigenetic profile (48). Of the remaining cell 
types, we chose HI as a representative of pluripotent em- 
bryonic stem cell, primary CD34+ as representative of 
multipotent haematopoietic cell, IMR90 (foetal lung fibro- 
blast) and PBMC as representative differentiated cell types. 
In addition, we have chosen two tissue types, brain and 
liver, which represent two organs having distinct physio- 
logical roles and germinal origin (brain being ectodermic 
and liver mesoendodermic). Similarities in epigenetic 
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signatures between these two tissues should reflect the 
global schema for distribution of epigenetic marks. Thus, 
our study involving such disparate cases of cell fate and 
identity allowed us to derive conclusions regarding the dis- 
tribution of epigenetic marks in general regardless of 
cellular differentiation status. 

DNA methylation is an important evolutionarily 
conserved epigenetic mark (49). It is known that the TSS 
of expressed protein-coding genes is hypomethylated and 
is in agreement with earlier observations that the methy- 
lation density of highly expressed protein-coding genes 
was lowest at their TSS and remained low even down- 
stream of the TSS (39). In contrast to the methylation 
pattern around TSS in highly expressed protein-coding 
genes, our results indicate that in lowly expressed 



protein-coding genes, the methylation density showed an 
upward trend from TSS and was highest immediately 
downstream of TSS in the region of first exons. This is 
consistent with earlier studies where it has been shown 
that DNA methylation in the immediate downstream 
regions of TSS, i.e. in the first exon, was much more 
tightly linked to gene silencing than promoter methylation 
(50). However, in IncRNA the methylation density is high 
in the downstream region of TSS, irrespective of their 
expression levels. Thus, unlike protein-coding genes, 
methylation downstream of TSS (in the first exon) is not 
a feature of IncRNA silencing suggesting that other 
factors might also be associated with IncRNA gene 
regulation. 
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Another evolutionarily conserved feature of TSS of 
protein-coding genes is their association with CGI (43). 
About half of all CGIs contain TSSs of annotated 
protein-coding genes (43). The others are classified as 
'orphan' CGIs. The purpose of such orphan CGIs is 
poorly understood (51). Several genome-wide Pol II 
mapping studies have revealed that a majority of these 
sites are also transcription initiation sites. Some of these 
IncRNAs like Air and Kcnqlotl have also been shown to 
be initiated from such 'orphan' CGIs present in intron of 
the Igf2r and Kcnql genes, respectively (52-54). From our 
analysis we found an overlap of CGIs with the TSS in 
~24% of IncRNA genes and by inductive reasoning we 
feel that such orphan CGIs might be the transcription 
initiation sites of other ncRNAs as well. CGI distribution 
within the genome is often concurrent with H3K4me3 



mark (55,56). It is a well-accepted paradigm that DNA 
methylation corresponds to repressive chromatin while 
H3K4me3 are associated with transcriptionally active 
chromatin (57,58). From our analysis we show that occur- 
rence of H3K4me3 marks in mRNA and IncRNA genes 
were higher when CGI was present, while the frequency 
decreases in the absence of CGI. This suggests that CGI of 
IncRNA are also marked by H3K4me3. However, when 
we looked at the association of repressive histone marks 
H3K27me3 and H3K9me3 with CGI present at the 
IncRNA and protein-coding genes, we did not find any 
relationship with the exception of brain germinal matrix 
tissue (in H3K27me3 class). We also found that ~40% 
TSS of protein-coding genes and ~12% TSS of IncRNA 
genes in brain germinal matrix tissue were having both 
H3K4me3 and H3K27me3 marks. 
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We also analysed histone modifications associated with 
active (H3K4me3 and H3K36me3) and repressed 
(H3K9me3 and H3K27me3) chromatin. The distribution 
pattern of H3K4me3 across cell and tissue type for both 
protein-coding and IncRNA showed a similar pattern with 



increased density at the TSS. Furthermore, presence of 
H3K4me3 and H3K36me3 modifications in the TSS and 
gene body, respectively, corresponded to higher expression 
of both protein-coding and IncRNA genes. This suggests 
that unlike the repressive methylation marks, presence of 
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these transcription activating marks could better explain 
the regulation of IncRNA expression. 

H3K27me3 seems to play similar roles in the expression 
of IncRNA and mRNA expression as the highly expressed 
transcripts of both classes seems to lack this mark at their 
TSS in contrast to higher occupancy of this repressive 
mark in the lowly expressed transcripts. This is consistent 
with a previous report suggesting that lncRNAs that are 
expressed at lower levels have higher H3K27me3 at their 
promoters. However, unlike H3K27me3, the repressive 
mark H3K9me3 does not seem to dictate the repression 
in IncRNA class as the highly expressed IncRNA also had 
its presence at their TSS in contrast to protein-coding 
genes which showed inverse correlation of expression in 
presence of this repressive mark. 

Furthermore, H3K4me3 and H3K27me3 are known to 
co-occupy certain genomic regions known as bivalent 
domains, which are associated with the promoters of 
lineage regulatory genes. We observed that occurrence of 
these bivalent marks (H3K4me3 and H3K27me3) was 
maximum in brain germinal matrix tissue: 41% in 
mRNA genes and 12% in IncRNA genes. Brain 
germinal matrix tissue is a proliferative centre which is 
source of neurons and glials cells. In all other datasets 
analysed, the occupancy was between 1% and 10% for 
mRNA genes and 0.4-3.2% for IncRNA genes. HI em- 
bryonic stem cells had 8.8% mRNA genes and 3.2% 
IncRNA genes occupied by bivalent marks. It is well 
known that lineage-related genes have bivalent marks in 
pluripotent stem cells. The role of such bivalent marks is 
generally believed to silence (H3K27me3) developmental 
lineage-specific genes while on the other hand poise them 
for subsequent activation via H3K4me3 during differenti- 
ation process. However, a recent study by Gobbi et al. 
suggests that the relation between the presence of 
bivalent marks in genes and their subsequent expression 
during differentiation may be oversimplistic (59). They 
found that genes that have bivalent marks in pluripotent 
and multipotent cells may be expressed at low levels 
during lineage priming. However, further studies are 
necessary to understand the implications of these 
bivalent marks in regulation of lineage-specific genes. 

Epigenetic marks like DNA methylation and histone 
modifications regulate the expression of genetic message 
and therefore determine cellular and hence organism's 
identity. LncRNAs are also involved in the manifestation 
of cellular identity; however, epigenetic marks governing 
their expression are not well characterized. We have found 
that a large proportion of IncRNA genes lack any of the 
aforesaid epigenetic marks. However, where present, they 
show a distribution pattern akin to that of protein-coding 
genes with the exception of DNA methylation. However, 
the distribution pattern of epigenetic features does not 
differ significantly for stem cells, differentiated cells and 
the tissue used, which indicates that the general behaviour 
of these processes remains unchanged regardless of differ- 
entiation and proliferative status. 

Thus, our observations show that DNA methyla- 
tion pattern at immediate vicinity of TSS is remarkably 
dissimilar for IncRNA and protein-coding genes. 
Furthermore, the histone marks, H3K4me3 and 



H3K36me3 and H3K27me3, correlate with the expression 
of IncRNA in a manner similar to that of mRNA. 
However, the repressive marks DNA methylation and 
H3K9me3 histone marks do not seem to be involved in 
the expression of lncRNAs. 
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